GAN Evaluation Metric

Objective Evaluation:

An empirical study on evaluation metrics of generative adversarial networks [1] with code.

  1. Inception Score (IS): classification score using the InceptionNet pretrained on ImageNet

    in which $p_M(y)$ is the marginal distribution of $p_M(y|x)$. Expect $p_M(y)$ to be of low entropy while $p_M(y|x)$ to be of high entropy. The higher, the better.

  2. Mode score: extension of Inception score

  3. Kernel MMD: MMD distance between two data distributions

  4. Wasserstein distance: Wasserstein distance (Earth mover’s distance) between two data distributions.

  5. Fréchet Inception Distance (FID): extract InceptionNet features and measure the data distribution distance. The lower, the better.

  6. KNN score: treat true data as positive and generated data as negative. Calculate the leave-one-out (LOO) accuracy based on 1-NN classifier.

  7. Learned Perceptual Image Patch Similarity (LPIPS): [3] [code]

Subjective Evaluation:

  1. Each user sees two randomly selected results at a time and is asked to choose the one that looks more realistic. After obtaining all the pairwise results, Bradley-Terry model (B-T model) is used to calculate the global ranking score for each method. [2]

Reference

  1. Xu, Qiantong, et al. “An empirical study on evaluation metrics of generative adversarial networks.” arXiv preprint arXiv:1806.07755 (2018).
  2. Tsai, Yi-Hsuan, et al. “Deep image harmonization.” CVPR, 2017.
  3. Zhang, Richard, et al. “The unreasonable effectiveness of deep features as a perceptual metric.” CVPR, 2018.